Predicting Neurodegenerative Diseases

This project was made as a part of the Data Insight Program of 2020

Author : Omar Ossama Mahmoud Ahmed
ID #: 87

Introduction to Parkinson's

Parkinson's disease, or simply Parkinson's, is a long-term degenerative disorder of the central nervous system that mainly affects the motor system. As the disease worsens, non-motor symptoms become more common. The symptoms usually emerge slowly. Early in the disease, the most obvious symptoms are shaking, rigidity, slowness of movement, and difficulty with walking.Thinking and behavioral problems may also occur. Dementia becomes common in the advanced stages of the disease.Depression and anxiety are also common, occurring in more than a third of people with PD. Other symptoms include sensory, sleep, and emotional problems. The main motor symptoms are collectively called "parkinsonism", or a "parkinsonian syndrome".

The cause of Parkinson's disease is unknown, but is believed to involve both genetic and environmental factors. Those with a family member affected are more likely to get the disease themselves.There is also an increased risk in people exposed to certain pesticides and among those who have had prior head injuries, while there is a reduced risk in tobacco smokers and those who drink coffee or tea. The motor symptoms of the disease result from the death of cells in the substantia nigra, a region of the midbrain. This results in not enough dopamine in this region of the brain.The cause of this cell death is poorly understood, but it involves the build-up of proteins into Lewy bodies in the neurons. Diagnosis of typical cases is mainly based on symptoms, with tests such as neuroimaging used to rule out other diseases.

There is no cure for Parkinson's disease

SOURCES:

Dataset

  • Parkinsons Data Set Data Repository by UC Irvine Machine Learning Repository
    • This dataset is updated by UC Irvine Machine Learning Repository

Table of Content :


  • Introduction to Parkinsons.
  • Installing Libraries.
  • Imports and Datasets.
  • Defining Functions Used.
  • Importing Datasets from Local files.
  • Exploring Imported Datasets.
  • Preprocessing of Datasets.
    • Cleaning Data.
  • Exploratory data analysis EDA.
    • Pandas Profiling.
    • Correlation Analysis.
    • Questions Raised.
    • Exploring most contributing features to status.
    • Exploring PPE Feature.
    • Exploring MDVP:Fo(Hz) Feature.
    • Exploring HNR Feature.
    • Exploring Relationships that produce the most differentiable status.
    • MDVP :Fo(Hz) Vs HNR.
    • MDVP :Fo(Hz) Vs PPE.
  • Machine Learning Exploration
    • Exploring Machine Learning Approach to Predict Paitent Status.
    • Importing Libraries used.
    • Preprocessing Data.
    • Logistic Regression Model.
    • SVM Model.
    • DecisionTreeClassifier Model.
    • Random Forest Classifier Model.
    • AdaBoost Classifier Model.
    • XGBoost Classifier Model.
    • Ensemble Voting Classifier Model.
    • Results.
  • Conclusion

Installing Libraries


In [1]:
pip install pandas-profiling
D:\DATASCIENCE>doskey tmc=java -jar E:\tmc-cli-0.9.2.jar $* 
Requirement already satisfied: pandas-profiling in c:\programdata\anaconda3\lib\site-packages (2.8.0)
Requirement already satisfied: missingno>=0.4.2 in c:\programdata\anaconda3\lib\site-packages (from pandas-profiling) (0.4.2)
Requirement already satisfied: tqdm>=4.43.0 in c:\programdata\anaconda3\lib\site-packages (from pandas-profiling) (4.47.0)
Requirement already satisfied: phik>=0.9.10 in c:\programdata\anaconda3\lib\site-packages (from pandas-profiling) (0.10.0)
Requirement already satisfied: astropy>=4.0 in c:\programdata\anaconda3\lib\site-packages (from pandas-profiling) (4.0.1.post1)
Requirement already satisfied: tangled-up-in-unicode>=0.0.6 in c:\programdata\anaconda3\lib\site-packages (from pandas-profiling) (0.0.6)
Requirement already satisfied: scipy>=1.4.1 in c:\programdata\anaconda3\lib\site-packages (from pandas-profiling) (1.5.0)
Requirement already satisfied: jinja2>=2.11.1 in c:\programdata\anaconda3\lib\site-packages (from pandas-profiling) (2.11.2)
Requirement already satisfied: visions[type_image_path]==0.4.4 in c:\programdata\anaconda3\lib\site-packages (from pandas-profiling) (0.4.4)
Requirement already satisfied: joblib in c:\programdata\anaconda3\lib\site-packages (from pandas-profiling) (0.16.0)
Requirement already satisfied: confuse>=1.0.0 in c:\programdata\anaconda3\lib\site-packages (from pandas-profiling) (1.3.0)
Requirement already satisfied: requests>=2.23.0 in c:\programdata\anaconda3\lib\site-packages (from pandas-profiling) (2.24.0)
Requirement already satisfied: matplotlib>=3.2.0 in c:\programdata\anaconda3\lib\site-packages (from pandas-profiling) (3.2.2)
Requirement already satisfied: numpy>=1.16.0 in c:\programdata\anaconda3\lib\site-packages (from pandas-profiling) (1.18.5)
Requirement already satisfied: ipywidgets>=7.5.1 in c:\programdata\anaconda3\lib\site-packages (from pandas-profiling) (7.5.1)
Requirement already satisfied: pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3 in c:\programdata\anaconda3\lib\site-packages (from pandas-profiling) (1.0.5)
Requirement already satisfied: htmlmin>=0.1.12 in c:\programdata\anaconda3\lib\site-packages (from pandas-profiling) (0.1.12)
Requirement already satisfied: seaborn in c:\programdata\anaconda3\lib\site-packages (from missingno>=0.4.2->pandas-profiling) (0.10.1)
Requirement already satisfied: numba>=0.38.1 in c:\programdata\anaconda3\lib\site-packages (from phik>=0.9.10->pandas-profiling) (0.50.1)
Requirement already satisfied: MarkupSafe>=0.23 in c:\programdata\anaconda3\lib\site-packages (from jinja2>=2.11.1->pandas-profiling) (1.1.1)
Requirement already satisfied: networkx>=2.4 in c:\programdata\anaconda3\lib\site-packages (from visions[type_image_path]==0.4.4->pandas-profiling) (2.4)
Requirement already satisfied: attrs>=19.3.0 in c:\programdata\anaconda3\lib\site-packages (from visions[type_image_path]==0.4.4->pandas-profiling) (19.3.0)
Requirement already satisfied: Pillow; extra == "type_image_path" in c:\programdata\anaconda3\lib\site-packages (from visions[type_image_path]==0.4.4->pandas-profiling) (7.2.0)
Requirement already satisfied: imagehash; extra == "type_image_path" in c:\programdata\anaconda3\lib\site-packages (from visions[type_image_path]==0.4.4->pandas-profiling) (4.1.0)
Requirement already satisfied: pyyaml in c:\programdata\anaconda3\lib\site-packages (from confuse>=1.0.0->pandas-profiling) (5.3.1)
Requirement already satisfied: idna<3,>=2.5 in c:\programdata\anaconda3\lib\site-packages (from requests>=2.23.0->pandas-profiling) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in c:\programdata\anaconda3\lib\site-packages (from requests>=2.23.0->pandas-profiling) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\programdata\anaconda3\lib\site-packages (from requests>=2.23.0->pandas-profiling) (1.25.9)
Requirement already satisfied: certifi>=2017.4.17 in c:\programdata\anaconda3\lib\site-packages (from requests>=2.23.0->pandas-profiling) (2020.6.20)
Requirement already satisfied: python-dateutil>=2.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling) (2.8.1)
Requirement already satisfied: cycler>=0.10 in c:\programdata\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling) (1.2.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling) (2.4.7)
Requirement already satisfied: traitlets>=4.3.1 in c:\programdata\anaconda3\lib\site-packages (from ipywidgets>=7.5.1->pandas-profiling) (4.3.3)
Requirement already satisfied: ipython>=4.0.0; python_version >= "3.3" in c:\programdata\anaconda3\lib\site-packages (from ipywidgets>=7.5.1->pandas-profiling) (7.16.1)
Requirement already satisfied: ipykernel>=4.5.1 in c:\programdata\anaconda3\lib\site-packages (from ipywidgets>=7.5.1->pandas-profiling) (5.3.0)
Requirement already satisfied: nbformat>=4.2.0 in c:\programdata\anaconda3\lib\site-packages (from ipywidgets>=7.5.1->pandas-profiling) (5.0.7)
Requirement already satisfied: widgetsnbextension~=3.5.0 in c:\programdata\anaconda3\lib\site-packages (from ipywidgets>=7.5.1->pandas-profiling) (3.5.1)
Requirement already satisfied: pytz>=2017.2 in c:\programdata\anaconda3\lib\site-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3->pandas-profiling) (2020.1)
Requirement already satisfied: llvmlite<0.34,>=0.33.0.dev0 in c:\programdata\anaconda3\lib\site-packages (from numba>=0.38.1->phik>=0.9.10->pandas-profiling) (0.33.0+1.g022ab0f)
Requirement already satisfied: setuptools in c:\programdata\anaconda3\lib\site-packages (from numba>=0.38.1->phik>=0.9.10->pandas-profiling) (47.3.1.post20200622)
Requirement already satisfied: decorator>=4.3.0 in c:\programdata\anaconda3\lib\site-packages (from networkx>=2.4->visions[type_image_path]==0.4.4->pandas-profiling) (4.4.2)
Requirement already satisfied: six in c:\users\omar ossama\appdata\roaming\python\python37\site-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.4.4->pandas-profiling) (1.12.0)
Requirement already satisfied: PyWavelets in c:\programdata\anaconda3\lib\site-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.4.4->pandas-profiling) (1.1.1)
Requirement already satisfied: ipython-genutils in c:\programdata\anaconda3\lib\site-packages (from traitlets>=4.3.1->ipywidgets>=7.5.1->pandas-profiling) (0.2.0)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in c:\programdata\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling) (3.0.5)
Requirement already satisfied: jedi>=0.10 in c:\programdata\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling) (0.17.1)
Requirement already satisfied: pygments in c:\programdata\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling) (2.6.1)
Requirement already satisfied: colorama; sys_platform == "win32" in c:\users\omar ossama\appdata\roaming\python\python37\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling) (0.4.1)
Requirement already satisfied: backcall in c:\programdata\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling) (0.2.0)
Requirement already satisfied: pickleshare in c:\programdata\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling) (0.7.5)
Requirement already satisfied: tornado>=4.2 in c:\programdata\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.5.1->pandas-profiling) (6.0.4)
Requirement already satisfied: jupyter-client in c:\programdata\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.5.1->pandas-profiling) (6.1.5)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in c:\programdata\anaconda3\lib\site-packages (from nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling) (3.2.0)
Requirement already satisfied: jupyter-core in c:\programdata\anaconda3\lib\site-packages (from nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling) (4.6.3)
Requirement already satisfied: notebook>=4.4.1 in c:\programdata\anaconda3\lib\site-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling) (6.0.3)
Requirement already satisfied: wcwidth in c:\programdata\anaconda3\lib\site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling) (0.2.5)
Requirement already satisfied: parso<0.8.0,>=0.7.0 in c:\programdata\anaconda3\lib\site-packages (from jedi>=0.10->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling) (0.7.0)
Requirement already satisfied: pyzmq>=13 in c:\programdata\anaconda3\lib\site-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets>=7.5.1->pandas-profiling) (19.0.1)
Requirement already satisfied: pyrsistent>=0.14.0 in c:\programdata\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling) (0.16.0)
Requirement already satisfied: importlib-metadata; python_version < "3.8" in c:\programdata\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling) (1.7.0)
Requirement already satisfied: pywin32>=1.0; sys_platform == "win32" in c:\programdata\anaconda3\lib\site-packages (from jupyter-core->nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling) (227)
Requirement already satisfied: nbconvert in c:\programdata\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling) (5.6.1)
Note: you may need to restart the kernel to use updated packages.Requirement already satisfied: prometheus-client in c:\programdata\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling) (0.8.0)
Requirement already satisfied: terminado>=0.8.1 in c:\programdata\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling) (0.8.3)
Requirement already satisfied: Send2Trash in c:\programdata\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling) (1.5.0)

Requirement already satisfied: zipp>=0.5 in c:\programdata\anaconda3\lib\site-packages (from importlib-metadata; python_version < "3.8"->jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling) (3.1.0)
Requirement already satisfied: mistune<2,>=0.8.1 in c:\programdata\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling) (0.8.4)
Requirement already satisfied: entrypoints>=0.2.2 in c:\programdata\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling) (0.3)
Requirement already satisfied: testpath in c:\programdata\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling) (0.4.4)
Requirement already satisfied: defusedxml in c:\programdata\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling) (0.6.0)
Requirement already satisfied: bleach in c:\programdata\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling) (3.1.5)
Requirement already satisfied: pandocfilters>=1.4.1 in c:\programdata\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling) (1.4.2)
Requirement already satisfied: packaging in c:\programdata\anaconda3\lib\site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling) (20.4)
Requirement already satisfied: webencodings in c:\programdata\anaconda3\lib\site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling) (0.5.1)
In [2]:
pip install empiricaldist;
D:\DATASCIENCE>doskey tmc=java -jar E:\tmc-cli-0.9.2.jar $* 
Requirement already satisfied: empiricaldist in c:\programdata\anaconda3\lib\site-packages (0.3.5)
Note: you may need to restart the kernel to use updated packages.
In [3]:
pip install plotly_express;
D:\DATASCIENCE>doskey tmc=java -jar E:\tmc-cli-0.9.2.jar $* 
Requirement already satisfied: plotly_express in c:\programdata\anaconda3\lib\site-packages (0.4.1)
Requirement already satisfied: patsy>=0.5 in c:\programdata\anaconda3\lib\site-packages (from plotly_express) (0.5.1)
Requirement already satisfied: statsmodels>=0.9.0 in c:\programdata\anaconda3\lib\site-packages (from plotly_express) (0.11.1)
Requirement already satisfied: plotly>=4.1.0 in c:\programdata\anaconda3\lib\site-packages (from plotly_express) (4.6.0)
Requirement already satisfied: numpy>=1.11 in c:\programdata\anaconda3\lib\site-packages (from plotly_express) (1.18.5)
Requirement already satisfied: scipy>=0.18 in c:\programdata\anaconda3\lib\site-packages (from plotly_express) (1.5.0)
Requirement already satisfied: pandas>=0.20.0 in c:\programdata\anaconda3\lib\site-packages (from plotly_express) (1.0.5)
Requirement already satisfied: six in c:\users\omar ossama\appdata\roaming\python\python37\site-packages (from patsy>=0.5->plotly_express) (1.12.0)
Requirement already satisfied: retrying>=1.3.3 in c:\programdata\anaconda3\lib\site-packages (from plotly>=4.1.0->plotly_express) (1.3.3)
Requirement already satisfied: python-dateutil>=2.6.1 in c:\programdata\anaconda3\lib\site-packages (from pandas>=0.20.0->plotly_express) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in c:\programdata\anaconda3\lib\site-packages (from pandas>=0.20.0->plotly_express) (2020.1)
Note: you may need to restart the kernel to use updated packages.

Imports and Datasets


  • Pandas : for dataset handeling
  • Numpy : Support for Pandas and calculations
  • Datetime: for date and times calculations
  • Math : for mathimatical operations
  • Matplotlib : for visualization (basic)
  • Empiricaldist : for statistical analysis
  • Seaborn : for visualization and plotting (Presentable)
  • pycountry : Library for getting continent (name) to from their country names
  • plotly : for interative plots
In [3]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import math
import pycountry
import pycountry_convert as pc
from plotly.subplots import make_subplots
import plotly_express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import empiricaldist as emp
%matplotlib inline
sns.set_style('darkgrid')

Defining Functions


  • ecdf() : for CDF calculation
  • ecdf_plot() : for CDF plotting
In [4]:
# Defining ecdf Function for Cumilative Distribution Function
def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""
    #credits DataCamp Justin Bois
    
    # Number of data points: n
    n = len(data)

    # x-data for the ECDF: x
    x = np.sort(data)

    # y-data for the ECDF: y
    y = np.arange(1, n+1) / n

    return x, y
In [5]:
# Defining ecdf plotting function between two variables
def ecdf_plot(c1,c2,t1='First set',t2='Second set'):
    """Plot ECDF for a one-dimensional array of measurements between two variables."""
    #Using ecdf to compute the CDF
    x1,y1 = list(ecdf(c1))
    x,y = list(ecdf(c2))

    #Create a subplot to fit two axis
    fig = make_subplots(rows=1, cols=2,subplot_titles=(f'CDF of {t1}', f'CDF of {t2}'))

    #add first plot of Budget
    fig.add_trace(
      go.Scatter(x= x1,y = y1,name = f'CDF of {t1}'),
      row=1, col=1
    )

    #add second plot of Revenue
    fig.add_trace(
      go.Scatter(x = x,y = y,name = f'CDF of {t2}'),
      row=1, col=2
    )

    #control title and figure dimentions
    fig.update_layout(height=500, width=1000, title_text="Cumulative distribution functions")
    fig.show()

Dataset Used


Parkinsons Data Set by UC Irvine Machine Learning Repository (LINK)


Source:

The dataset was created by Max Little of the University of Oxford, in collaboration with the National Centre for Voice and Speech, Denver, Colorado, who recorded the speech signals. The original study published the feature extraction methods for general voice disorders.

Data Set Information:

This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.

The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column.For further information or to pass on comments, please contact Max Little (littlem '@' robots.ox.ac.uk).

Further details are contained in the following reference -- if you use this dataset, please cite: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering (to appear).

Attribute Information:

  • Matrix column entries (attributes):
    • name - ASCII subject name and recording number
    • MDVP:Fo(Hz) - Average vocal fundamental frequency
    • MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
    • MDVP:Flo(Hz) - Minimum vocal fundamental frequency
    • MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency
    • MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude
    • NHR,HNR - Two measures of ratio of noise to tonal components in the voice
    • status - Health status of the subject (one) - Parkinson's, (zero) - healthy
    • RPDE,D2 - Two nonlinear dynamical complexity measures
    • DFA - Signal fractal scaling exponent spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation

Importing Datasets from Local files


Parkinsons Data Set from UC Irvine Machine Learning Repository

  • parkinson's.csv : Parkinsons Data Set from UC Irvine Machine Learning Repository

In [6]:
#Importing datasets of Parkinson's
df = pd.read_csv('D:\DATASCIENCE\Project 3\Dataset\\parkinsons.csv')
In [7]:
display(df.head())
display(df.describe())
display(df.info())
name MDVP:Fo(Hz) MDVP:Fhi(Hz) MDVP:Flo(Hz) MDVP:Jitter(%) MDVP:Jitter(Abs) MDVP:RAP MDVP:PPQ Jitter:DDP MDVP:Shimmer ... Shimmer:DDA NHR HNR status RPDE DFA spread1 spread2 D2 PPE
0 phon_R01_S01_1 119.992 157.302 74.997 0.00784 0.00007 0.00370 0.00554 0.01109 0.04374 ... 0.06545 0.02211 21.033 1 0.414783 0.815285 -4.813031 0.266482 2.301442 0.284654
1 phon_R01_S01_2 122.400 148.650 113.819 0.00968 0.00008 0.00465 0.00696 0.01394 0.06134 ... 0.09403 0.01929 19.085 1 0.458359 0.819521 -4.075192 0.335590 2.486855 0.368674
2 phon_R01_S01_3 116.682 131.111 111.555 0.01050 0.00009 0.00544 0.00781 0.01633 0.05233 ... 0.08270 0.01309 20.651 1 0.429895 0.825288 -4.443179 0.311173 2.342259 0.332634
3 phon_R01_S01_4 116.676 137.871 111.366 0.00997 0.00009 0.00502 0.00698 0.01505 0.05492 ... 0.08771 0.01353 20.644 1 0.434969 0.819235 -4.117501 0.334147 2.405554 0.368975
4 phon_R01_S01_5 116.014 141.781 110.655 0.01284 0.00011 0.00655 0.00908 0.01966 0.06425 ... 0.10470 0.01767 19.649 1 0.417356 0.823484 -3.747787 0.234513 2.332180 0.410335

5 rows × 24 columns

MDVP:Fo(Hz) MDVP:Fhi(Hz) MDVP:Flo(Hz) MDVP:Jitter(%) MDVP:Jitter(Abs) MDVP:RAP MDVP:PPQ Jitter:DDP MDVP:Shimmer MDVP:Shimmer(dB) ... Shimmer:DDA NHR HNR status RPDE DFA spread1 spread2 D2 PPE
count 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 ... 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000
mean 154.228641 197.104918 116.324631 0.006220 0.000044 0.003306 0.003446 0.009920 0.029709 0.282251 ... 0.046993 0.024847 21.885974 0.753846 0.498536 0.718099 -5.684397 0.226510 2.381826 0.206552
std 41.390065 91.491548 43.521413 0.004848 0.000035 0.002968 0.002759 0.008903 0.018857 0.194877 ... 0.030459 0.040418 4.425764 0.431878 0.103942 0.055336 1.090208 0.083406 0.382799 0.090119
min 88.333000 102.145000 65.476000 0.001680 0.000007 0.000680 0.000920 0.002040 0.009540 0.085000 ... 0.013640 0.000650 8.441000 0.000000 0.256570 0.574282 -7.964984 0.006274 1.423287 0.044539
25% 117.572000 134.862500 84.291000 0.003460 0.000020 0.001660 0.001860 0.004985 0.016505 0.148500 ... 0.024735 0.005925 19.198000 1.000000 0.421306 0.674758 -6.450096 0.174351 2.099125 0.137451
50% 148.790000 175.829000 104.315000 0.004940 0.000030 0.002500 0.002690 0.007490 0.022970 0.221000 ... 0.038360 0.011660 22.085000 1.000000 0.495954 0.722254 -5.720868 0.218885 2.361532 0.194052
75% 182.769000 224.205500 140.018500 0.007365 0.000060 0.003835 0.003955 0.011505 0.037885 0.350000 ... 0.060795 0.025640 25.075500 1.000000 0.587562 0.761881 -5.046192 0.279234 2.636456 0.252980
max 260.105000 592.030000 239.170000 0.033160 0.000260 0.021440 0.019580 0.064330 0.119080 1.302000 ... 0.169420 0.314820 33.047000 1.000000 0.685151 0.825288 -2.434031 0.450493 3.671155 0.527367

8 rows × 23 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              195 non-null    object 
 1   MDVP:Fo(Hz)       195 non-null    float64
 2   MDVP:Fhi(Hz)      195 non-null    float64
 3   MDVP:Flo(Hz)      195 non-null    float64
 4   MDVP:Jitter(%)    195 non-null    float64
 5   MDVP:Jitter(Abs)  195 non-null    float64
 6   MDVP:RAP          195 non-null    float64
 7   MDVP:PPQ          195 non-null    float64
 8   Jitter:DDP        195 non-null    float64
 9   MDVP:Shimmer      195 non-null    float64
 10  MDVP:Shimmer(dB)  195 non-null    float64
 11  Shimmer:APQ3      195 non-null    float64
 12  Shimmer:APQ5      195 non-null    float64
 13  MDVP:APQ          195 non-null    float64
 14  Shimmer:DDA       195 non-null    float64
 15  NHR               195 non-null    float64
 16  HNR               195 non-null    float64
 17  status            195 non-null    int64  
 18  RPDE              195 non-null    float64
 19  DFA               195 non-null    float64
 20  spread1           195 non-null    float64
 21  spread2           195 non-null    float64
 22  D2                195 non-null    float64
 23  PPE               195 non-null    float64
dtypes: float64(22), int64(1), object(1)
memory usage: 36.7+ KB
None

Cleaning Data


After quick overview of the data:

Data does seem clean to proceed to EDA Phase with minimal to no cleaning, with no missing data or unmatching data types.

Preferably cleaning the 'name' column to discard of 'phonR01' prefix to further simplify the data.

As well as changing Status type from int to categorical variable.

In [8]:
# Changing Status type to categorical
df.status = df.status.astype('category')
In [9]:
# Simplifying name column
df.replace(to_replace ='phon_R01_', value = '', regex = True, inplace = True)

Exploratory Data Analysis (EDA)

Correlation using Spearman Correlation

In [10]:
ax = sns.heatmap(df.corr(method='spearman'))
plt.title('Correlation between features Using Spearman Method');

General Observations:

  • A quick look through the correlation matrix suggests a high correlation between several columns with same product of measurement.
  • The Kay Pentax multidimensional voice program (MDVP) and their componants of measurements are highly correlated which is to be expected.
  • Harmonics to Noise Ratio (HNR) is negatively correlated with MDVP componants.
  • status is most correlated with spread 1 and spread 2 as well as MDVP APQ.

Questions Raised:

  • From previous analysis and visually exploring data some questions arised:-

    1- Which of the 22 features are the most contributing to the status of the patient?

    2- Which relationship between two features would produce a distinction between status 1 and 0?


Exploring most contributing features to status.

Correlation diagram suggests thet PPE and MDVP:Fo and HNR are the most contributing features to status with either negative or positive correlation value.


Exploring PPE feature

In [11]:
PPE_1 = df[df.status == 1]['PPE']
PPE_0 = df[df.status == 0]['PPE']

fig = ecdf_plot(PPE_0,PPE_1,'PPE for Negative Patients','PPE for Positive Patients')
In [12]:
fig = px.histogram(df, x = 'PPE', color= df.status, marginal = "box", title = 'Histogram of PPE Distribution over both Status')

# Overlay both histograms
fig.update_layout(barmode='overlay')

# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
Observations:
  • CDF shows distribution of values throughout the feature with respect to the status of the patient which shows significant difference between both distributions.

  • Histogram shows another perspective of the distribution of the feature with PPE higher than 0.3 signifies positive status.

  • Box plot shows a clearer picture of the feature at question showing values higher than 0.21 are most probable to be positive.


Exploring MDVP:Fo feature

In [13]:
PPE_1 = df[df.status == 1]['MDVP:Fo(Hz)']
PPE_0 = df[df.status == 0]['MDVP:Fo(Hz)']

ecdf_plot(PPE_0,PPE_1,'MDVP:Fo(Hz) for Negative Patients','MDVP:Fo(Hz) for Positive Patients')
In [14]:
fig = px.histogram(df, x = 'MDVP:Fo(Hz)', color= df.status, marginal = "box", title = 'Histogram of MDVP:Fo Distribution over both Status')

# Overlay both histograms
fig.update_layout(barmode='overlay')

# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
Observations:
  • CDF shows distribution of values throughout the feature with respect to the status of the patient which shows significant difference between both distributions specially in values lower than 110 Hz.

  • Histogram shows another perspective of the distribution of the feature with Fo (Hz) higher than 220 Hz signifies Negative status.

  • Box plot shows a clearer picture of the feature at question showing values lower than 120 Hz are most probable to be positive.


Exploring HNR feature

In [15]:
PPE_1 = df[df.status == 1]['HNR']
PPE_0 = df[df.status == 0]['HNR']

ecdf_plot(PPE_0,PPE_1,'HNR for Negative Patients','HNR for Positive Patients')
In [16]:
fig = px.histogram(df, x = 'HNR', color= df.status, marginal = "box", title = 'Histogram of HNR Distribution over both Status')

# Overlay both histograms
fig.update_layout(barmode='overlay')

# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
Observations:
  • CDF shows High difference between the two variables at higher percentiles and lower ones, with mid range having similarities.

  • Histogram shows overlapping statuses in midrange 17 and 30 yet significant difference between both status at values lower than 18.

  • Box plot shows a clearer picture of the feature at question showing values higher than 30 are most probable to be negative while values lower than 17 are most probably positive.


Exploring Relationships that produce the most differentiable status.

Plotting a scatter plot matrix between all features to visually distinct which features could produce the most clear differentiable statuses.

Also taking into account correlation matrix as deduced above and their EDA.

In [19]:
sns.pairplot(df, hue="status");

Deduced from the above figure that MDVP:Fo(HZ) has the most differentiable measurement more over when correlated with HNR and PPE results are very interesting.

MDVP :Fo(Hz) Vs HNR


In [17]:
fig = px.scatter(df, x ='HNR',y = 'MDVP:Fo(Hz)', color = 'status', title = 'Relationship between MDVP:Fo & HNR wrt Status')
fig
Observations:
  • High values of Fo (Hz) - > 223 Hz - correlates with negative status.
  • HNR values lower than 17 correlates with positive status regardless of Fo.
  • Midrange Fo between 174.1 and 129 correlates with positive status regardless of HNR value.

MDVP :Fo(Hz) Vs PPE


In [18]:
fig = px.scatter(df, x ='PPE',y = 'MDVP:Fo(Hz)', color = 'status', title = 'Relationship between MDVP:Fo & PPE wrt Status')
fig
Observations:
  • PPE values higher than 0.25 correlates with positive status.
  • Positive status PPE values lies between 0.09 and 0.5 with values higher than 0.25 correlates with positive status.
  • Both statuses are very distinct in this relationship.

Exploring Machine Learning Approach to Predict Paitent Status

Importing Libraries used

In [21]:
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler 
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, roc_curve, auc
import xgboost as xgb

Preprocessing Data

Defining Target for the model and features

In [22]:
Y = df['status'].values # Target for the model
X = df[['MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'MDVP:Jitter(%)','MDVP:Jitter(Abs)', 'MDVP:RAP', 'MDVP:PPQ', 'Jitter:DDP','MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5','MDVP:APQ', 'Shimmer:DDA', 'NHR', 'HNR', 'RPDE', 'DFA','spread1', 'spread2', 'D2', 'PPE']]
In [23]:
# Spliting data to test and train sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=42)
In [24]:
# Normalizing Data using MinMaxScaler
scaler = MinMaxScaler().fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

Defining list to store Performance Metrics

In [25]:
performance = [] # list to store all performance metric

Logistic Regression Model

In [26]:
lr_best_score = 0
lr_kfolds = 5 # set the number of folds


# Finding the Best lr Model 
for c in [0.001, 0.1, 1, 4, 10, 100]:
    logRegModel = LogisticRegression(C=c)
    
    # perform cross-validation
    scores = cross_val_score(logRegModel, X_train, Y_train, cv = lr_kfolds, scoring = 'accuracy') # Get recall for each parameter setting
    
    # compute mean cross-validation accuracy
    score = np.mean(scores)
    
    # Find the best parameters and score
    if score > lr_best_score:
        lr_best_score = score
        lr_best_parameters = c

# rebuild a model on the combined training and validation set
SelectedLogRegModel = LogisticRegression(C = lr_best_parameters).fit(X_train_scaled, Y_train)

# Model Test
lr_test_score = SelectedLogRegModel.score(X_test_scaled, Y_test)

# Predicted Output of Model
PredictedOutput = SelectedLogRegModel.predict(X_test_scaled)

# Extracting sensitivity & specificity from ROC curve to measure Performance of the model
lr_fpr, lr_tpr, lr_thresholds = roc_curve(Y_test, PredictedOutput, pos_label=1)

# Using AUC of ROC to validate models based on a single score
lr_test_auc = auc(lr_fpr, lr_tpr)


# Output Printing Scores
print("Best accuracy on validation set is:", lr_best_score)

print("Best parameter for regularization (C) is: ", lr_best_parameters)

print("Test accuracy with best C parameter is", lr_test_score)
print("Test AUC with the best C parameter is", lr_test_auc)

# Appending results to performance list
m = 'Logistic Regression'
performance.append([m, lr_test_score, lr_test_auc, lr_fpr, lr_tpr, lr_thresholds])
Best accuracy on validation set is: 0.8427586206896553
Best parameter for regularization (C) is:  4
Test accuracy with best C parameter is 0.8979591836734694
Test AUC with the best C parameter is 0.7727272727272727

SVM Model

In [27]:
svm_best_score = 0
svm_kfolds = 5


for c_paramter in [0.001, 0.01, 0.1,6, 10, 100, 1000]: #iterate over the values we need to try for the parameter C
    for gamma_paramter in [0.001, 0.01, 0.1,5, 10, 100, 1000]: #iterate over the values we need to try for the parameter gamma
        for k_parameter in ['rbf', 'linear', 'poly', 'sigmoid']: # iterate over the values we need to try for the kernel parameter
            
            svmModel = SVC(kernel=k_parameter, C=c_paramter, gamma=gamma_paramter) #define the model
            # perform cross-validation
            
            scores = cross_val_score(svmModel, X_train_scaled, Y_train, cv = svm_kfolds, scoring='accuracy')
            # the training set will be split internally into training and cross validation

            # compute mean cross-validation accuracy
            score = np.mean(scores)
            
            # if we got a better score, store the score and parameters           
            if score > svm_best_score:
                svm_best_score = score #store the score 
                svm_best_parameter_c = c_paramter #store the parameter c
                svm_best_parameter_gamma = gamma_paramter #store the parameter gamma
                svm_best_parameter_k = k_parameter
            

# rebuild a model with best parameters to get score 
SelectedSVMmodel = SVC(C = svm_best_parameter_c, gamma = svm_best_parameter_gamma, kernel = svm_best_parameter_k).fit(X_train_scaled, Y_train)

# Model Test
svm_test_score = SelectedSVMmodel.score(X_test_scaled, Y_test)

# Predicted Output of Model
PredictedOutput = SelectedSVMmodel.predict(X_test_scaled)

# Extracting sensitivity & specificity from ROC curve to measure Performance of the model
svm_fpr, svm_tpr, svm_thresholds = roc_curve(Y_test, PredictedOutput, pos_label=1)

# Using AUC of ROC to validate models based on a single score
svm_test_auc = auc(svm_fpr, svm_tpr)


# Output Printing Scores
print("Best accuracy on cross validation set is:", svm_best_score)

print("Best parameter for c is: ", svm_best_parameter_c)
print("Best parameter for gamma is: ", svm_best_parameter_gamma)
print("Best parameter for kernel is: ", svm_best_parameter_k)

print("Test accuracy with the best parameters is", svm_test_score)
print("Test AUC with the best parameter is", svm_test_auc)



# Appending results to performance list
m = 'SVM'
performance.append([m, svm_test_score, svm_test_auc, svm_fpr, svm_tpr, svm_thresholds])
Best accuracy on cross validation set is: 0.9448275862068967
Best parameter for c is:  6
Best parameter for gamma is:  5
Best parameter for kernel is:  rbf
Test accuracy with the best parameters is 0.9387755102040817
Test AUC with the best parameter is 0.8636363636363636

DecisionTreeClassifier Model

In [28]:
dt_best_score = 0
dt_kfolds = 10


for md in range(1, 9): # iterate different maximum depth values
    # train the model
    treeModel = DecisionTreeClassifier(random_state=0, max_depth=md, criterion='gini')
    
    # perform cross-validation
    scores = cross_val_score(treeModel, X_train_scaled, Y_train, cv = dt_kfolds, scoring='accuracy')
    
    # compute mean cross-validation accuracy
    score = np.mean(scores)
    
    # if we got a better score, store the score and parameters
    if score > dt_best_score:
        dt_best_score = score
        dt_best_parameter = md

        
# Rebuild a model on the combined training and validation set        
SelectedDTModel = DecisionTreeClassifier(max_depth = dt_best_parameter).fit(X_train_scaled, Y_train )

# Model Test
dt_test_score = SelectedDTModel.score(X_test_scaled, Y_test)

# Predicted Output of Model
PredictedOutput = SelectedDTModel.predict(X_test_scaled)

# Extracting sensitivity & specificity from ROC curve to measure Performance of the model
dt_fpr, dt_tpr, dt_thresholds = roc_curve(Y_test, PredictedOutput, pos_label=1)

# Using AUC of ROC to validate models based on a single score
dt_test_auc = auc(dt_fpr, dt_tpr)


# Output Printing Scores
print("Best accuracy on validation set is:", dt_best_score)

print("Best parameter for the maximum depth is: ", dt_best_parameter)

print("Test accuracy with best parameter is ", dt_test_score)
print("Test AUC with the best parameter is ", dt_test_auc)


# Appending results to performance list
m = 'Decision Tree'
performance.append([m, dt_test_score, dt_test_auc, dt_fpr, dt_tpr, dt_thresholds])
Best accuracy on validation set is: 0.8642857142857144
Best parameter for the maximum depth is:  5
Test accuracy with best parameter is  0.8979591836734694
Test AUC with the best parameter is  0.8050239234449762

Features Importance

In [29]:
print("Feature importance: ")
features = np.array([X.columns.values.tolist(), list(SelectedDTModel.feature_importances_)]).T

for i in features:
    print(f'{i[0]} : {i[1]}')
Feature importance: 
MDVP:Fo(Hz) : 0.2789793597522502
MDVP:Fhi(Hz) : 0.0
MDVP:Flo(Hz) : 0.0
MDVP:Jitter(%) : 0.0
MDVP:Jitter(Abs) : 0.06478134339071083
MDVP:RAP : 0.0
MDVP:PPQ : 0.0
Jitter:DDP : 0.07963422307864516
MDVP:Shimmer : 0.0
MDVP:Shimmer(dB) : 0.0
Shimmer:APQ3 : 0.0
Shimmer:APQ5 : 0.0
MDVP:APQ : 0.0
Shimmer:DDA : 0.0
NHR : 0.0
HNR : 0.13032482023307707
RPDE : 0.05792214232581203
DFA : 0.0
spread1 : 0.0
spread2 : 0.0
D2 : 0.03439127200595089
PPE : 0.3539668392135539

Random Forest Classifier Model

In [30]:
rf_best_score = 0
rf_kfolds = 5


for M in range(2, 15, 2): # combines M trees
    for d in range(1, 9): # maximum number of features considered at each split
        for m in range(1, 9): # maximum depth of the tree
            # train the model
            # n_jobs(4) is the number of parallel computing
            forestModel = RandomForestClassifier(n_estimators = M, max_features = d, n_jobs = 4,max_depth = m, random_state = 0)
        
            # perform cross-validation
            scores = cross_val_score(forestModel, X_train_scaled, Y_train, cv = rf_kfolds, scoring = 'accuracy')

            # compute mean cross-validation accuracy
            score = np.mean(scores)

            # if we got a better score, store the score and parameters
            if score > rf_best_score:
                rf_best_score = score
                rf_best_M = M
                rf_best_d = d
                rf_best_m = m

# Rebuild a model on the combined training and validation set        
SelectedRFModel = RandomForestClassifier(n_estimators=rf_best_M, max_features=rf_best_d,max_depth=rf_best_m, random_state=0).fit(X_train_scaled, Y_train )


# Model Test
rf_test_score = SelectedRFModel.score(X_test_scaled, Y_test)

# Predicted Output of Model
PredictedOutput = SelectedRFModel.predict(X_test_scaled)

# Extracting sensitivity & specificity from ROC curve to measure Performance of the model
rf_fpr, rf_tpr, rf_thresholds = roc_curve(Y_test, PredictedOutput, pos_label=1)

# Using AUC of ROC to validate models based on a single score
rf_test_auc = auc(rf_fpr, rf_tpr)


# Output Printing Scores
print("Best accuracy on validation set is:", rf_best_score)

print("Best parameters of M, d, m are: ", rf_best_M, rf_best_d, rf_best_m)

print("Test accuracy with the best parameters is", rf_test_score)
print("Test AUC with the best parameters is:", rf_test_auc)



# Appending results to performance list
m = 'Random Forest'
performance.append([m, rf_test_score, rf_test_auc, rf_fpr, rf_tpr, rf_thresholds])
Best accuracy on validation set is: 0.9317241379310346
Best parameters of M, d, m are:  10 7 7
Test accuracy with the best parameters is 0.8775510204081632
Test AUC with the best parameters is: 0.7918660287081339

AdaBoost Classifier Model

In [31]:
ada_best_score = 0
ada_kfolds = 5


for M in range(2, 15, 2): # combines M trees
    for lr in [0.0001, 0.001, 0.01, 0.1, 1,2,3]:
        # train the model
        boostModel = AdaBoostClassifier(n_estimators=M, learning_rate=lr, random_state=0)

        # perform cross-validation
        scores = cross_val_score(boostModel, X_train_scaled, Y_train, cv = ada_kfolds, scoring = 'accuracy')

        # compute mean cross-validation accuracy
        score = np.mean(scores)

        # if we got a better score, store the score and parameters
        if score > ada_best_score:
            ada_best_score = score
            ada_best_M = M
            ada_best_lr = lr

# Rebuild a model on the combined training and validation set        
SelectedBoostModel = AdaBoostClassifier(n_estimators=ada_best_M, learning_rate=ada_best_lr, random_state=0).fit(X_train_scaled, Y_train )




# Model Test
ada_test_score = SelectedBoostModel.score(X_test_scaled, Y_test)

# Predicted Output of Model
PredictedOutput = SelectedBoostModel.predict(X_test_scaled)

# Extracting sensitivity & specificity from ROC curve to measure Performance of the model
ada_fpr, ada_tpr, ada_thresholds = roc_curve(Y_test, PredictedOutput, pos_label=1)

# Using AUC of ROC to validate models based on a single score
ada_test_auc = auc(ada_fpr, ada_tpr)


# Output Printing Scores
print("Best accuracy on validation set is:", ada_best_score)

print("Best parameter of M is: ", ada_best_M)
print("best parameter of LR is: ", ada_best_lr)

print("Test accuracy with the best parameter is", ada_test_score)
print("Test AUC with the best parameters is:", ada_test_auc)


# Appending results to performance list
m = 'AdaBoost'
performance.append([m, ada_test_score, ada_test_auc, ada_fpr, ada_tpr, ada_thresholds])
Best accuracy on validation set is: 0.9039080459770116
Best parameter of M is:  10
best parameter of LR is:  1
Test accuracy with the best parameter is 0.8775510204081632
Test AUC with the best parameters is: 0.7595693779904306

XGBoost Classifier Model

In [32]:
xgb_best_score = 0
xgb_kfolds = 5

for n in [2,4,6,8,10]: #iterate over the values we need to try for the parameter n_estimators 
    for lr in [1.1,1.2,1.22,1.23,1.3]: #iterate over the values we need to try for the learning rate parameter
        for depth in [2,4,6,8,10]: # iterate over the values we need to try for the depth parameter
            XGB = xgb.XGBClassifier(objective = 'binary:logistic', max_depth=depth, n_estimators=n, learning_rate = lr) #define the model

            # perform cross-validation
            scores = cross_val_score(XGB, X_train_scaled, Y_train, cv = xgb_kfolds, scoring='accuracy')
            # the training set will be split internally into training and cross validation

            # compute mean cross-validation accuracy
            score = np.mean(scores)

            # if we got a better score, store the score and parameters
            if score > xgb_best_score:
                xgb_best_score = score #store the score 
                xgb_best_md = depth #store the parameter maximum depth
                xgb_best_ne = n #store the parameter n_estimators
                xgb_best_lr = lr #store the parameter learning rate


# rebuild a model with best parameters to get score 
XGB_selected = xgb.XGBClassifier(objective = 'binary:logistic',max_depth=xgb_best_md, n_estimators=xgb_best_ne, learning_rate = xgb_best_lr).fit(X_train_scaled, Y_train)


# Model Test
xgb_test_score = XGB_selected.score(X_test_scaled, Y_test)

# Predicted Output of Model
PredictedOutput = XGB_selected.predict(X_test_scaled)

# Extracting sensitivity & specificity from ROC curve to measure Performance of the model
xgb_fpr, xgb_tpr, xgb_thresholds = roc_curve(Y_test, PredictedOutput, pos_label=1)

# Using AUC of ROC to validate models based on a single score
xgb_test_auc = auc(xgb_fpr, xgb_tpr)



# Output Printing Scores
print("Best accuracy on cross validation set is:", xgb_best_score)

print("Best parameter for maximum depth is: ", xgb_best_md)
print("Best parameter for n_estimators is: ", xgb_best_ne)
print("Best parameter for learning rate is: ", xgb_best_lr)

print("Test accuracy with the best parameters is", xgb_test_score)
print("Test AUC with the best parameters is:", xgb_test_auc)


# Appending results to performance list
m = 'XGB'
performance.append([m, xgb_test_score, xgb_test_auc, xgb_fpr, xgb_tpr, xgb_thresholds])
Best accuracy on cross validation set is: 0.93816091954023
Best parameter for maximum depth is:  2
Best parameter for n_estimators is:  10
Best parameter for learning rate is:  1.2
Test accuracy with the best parameters is 0.8979591836734694
Test AUC with the best parameters is: 0.8050239234449762

Ensemble Voting Classifier Model

In [33]:
from sklearn.ensemble import VotingClassifier

# Logistic regression model
clf1 = LogisticRegression(C=lr_best_parameters).fit(X_train_scaled, Y_train)

# SVC Model
clf2 = SVC(C=svm_best_parameter_c, gamma=svm_best_parameter_gamma, kernel=svm_best_parameter_k,probability=True)

# DecisionTreeClassifier model
clf3 = DecisionTreeClassifier(max_depth=dt_best_parameter)

# Random Forest Classifier Model
clf4 = RandomForestClassifier(n_estimators=rf_best_M, max_features=rf_best_d,max_depth=rf_best_m, random_state=42)

# AdaBoostClassifier Model
clf5 = AdaBoostClassifier(n_estimators=ada_best_M, learning_rate=ada_best_lr, random_state=42)

# XGBoost Classifier Model
clf6 = xgb.XGBClassifier(objective = 'binary:logistic',max_depth=xgb_best_md, n_estimators=xgb_best_ne ,learning_rate = xgb_best_lr)



# Defining VotingClassifier
eclf1 = VotingClassifier(estimators=[ ('LogisticRegression', clf1), ('SVC', clf2),('DecisionTree',clf3),('Random Forest', clf4),('ADABoost',clf5),('XGBoost',clf6)], voting='hard')


# Fitting VotingClassifier
eclf1 = eclf1.fit(X_train_scaled, Y_train)


# Model Test
eclf_test_score = eclf1.score(X_test_scaled, Y_test)

# Predicted Output of Model
PredictedOutput = eclf1.predict(X_test_scaled)

# Extracting sensitivity & specificity from ROC curve to measure Performance of the model
eclf_fpr, eclf_tpr, eclf_thresholds = roc_curve(Y_test, PredictedOutput, pos_label=1)

# Using AUC of ROC to validate models based on a single score
eclf_test_auc = auc(eclf_fpr, eclf_tpr)



# Output Printing Scores
print("Test accuracy with the best parameters is", eclf_test_score)
print("Test AUC with the best parameters is:", eclf_test_auc)


# Appending results to performance list
m = 'ECLF'
performance.append([m, eclf_test_score, eclf_test_auc, eclf_fpr, eclf_tpr, eclf_thresholds])
Test accuracy with the best parameters is 0.9183673469387755
Test AUC with the best parameters is: 0.8181818181818181

Model Results for Top performing models


In [34]:
result = pd.DataFrame(performance, columns=['Model', 'Accuracy', 'AUC', 'FPR', 'TPR', 'TH'])
df = result[['Model', 'Accuracy', 'AUC']]
results = df.sort_values('Accuracy',ascending = False)
display(results.style.background_gradient(cmap='Reds',subset=["Accuracy"]).background_gradient(cmap='Greens',subset=["AUC"]))
Model Accuracy AUC
1 SVM 0.938776 0.863636
6 ECLF 0.918367 0.818182
0 Logistic Regression 0.897959 0.772727
2 Decision Tree 0.897959 0.805024
5 XGB 0.897959 0.805024
3 Random Forest 0.877551 0.791866
4 AdaBoost 0.877551 0.759569

Conclusion

  • Best Model Accuracy is SVM Classifier at ~94% accuracy.
  • Best AUC is SVM Classifier at 86% area.
  • Ensemble Voting is at ~92% accuracy and 81% area.
  • Most contributing features are PPE at 35% followed by MDVP:Fo(Hz) at 28% and HNR at 13%.
  • Machine Learning Findings align with EDA.